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PARALLEL IMPLEMENTATION OF THE DISCONTINUOUS GALERKIN METHOD 


ABDELKADER BAGGAGE HAROLD ATKINS*, AND DAVID KEYES^ 

Abstract, This paper describes a parallel implementation of the discontinuous Galerkin method. Dis- 
continuous Galerkin is a spatially compact method that retains its accuracy and robustness on non-smooth 
unstructured grids and is well suited for time dependent simulations. Several parallelization approaches are 
studied and evaluated. The most natural and symmetric of the approaches has been implemented in an 
object-oriented code used to simulate aeroacoustic scattering. The parallel implementation is MPI- based 
and has been tested on various parallel platforms such as the SGI Origin, IBM SP2, and clusters of SGI and 
Sun workstations. The scalability results presented for the SGI Origin show slightly superlinear speedup on 
a fixed-size problem due to cache effects. 

Key words, discontinuous Galerkin method, parallelization strategies, object oriented, unstructured 
grids, Euler equations, high-order accuracy 

Subject classification. Computer Science 

1. Motivation* The discontinuous Galerkin (DG) method is a robust and compact finite element 
projection method that provides a practical framework for the development of high-order accurate methods 
using unstructured grids. The method is well suited for large-scale time-dependent computations in which 
high accuracy is required. An important distinction between the DG method and the usual finite-element 
method is that in the DG method the resulting equations are local to the generating element. The solution 
within each element is not reconstructed by looking to neighboring elements. Thus, each element may be 
thought of as a separate entity that merely needs to obtain some boundary data from its neighbors. The 
compact form of the DG method makes it well suited for parallel computer platforms. This compactness also 
allows a heterogeneous treatment of problems. That is, the element topology, the degree of approximation 
and even the choice of governing equations can vary from element to element and in time over the course of 
a calculation without loss of rigor in the method. 

Many of the method's accuracy and stability properties have been rigorously proven [1, 2, 3, 4, 5] for 
arbitrary element shapes, any number of spatial dimensions, and even for nonlinear problems, which lead to 
a very robust method. The DG method has been shown in mesh refinement studies [6] to be insensitive to the 
smoothness of the mesh. Its compact formulation can be applied near boundaries without special treatment, 
which greatly increases the robustness and accuracy of any boundary condition implementation. These 
features are crucial for the robust treatment of complex geometries. In semi-discrete form, the DG method 
can be combined with explicit time-marching methods, such as Runge-Kutta. One of the disadvantages 
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of the method is its high storage and high computational requirements; however, a recently developed 
quadrature-free implementation [6] has greatly ameliorated these concerns. 

Parallel implementations of the DG method have been performed by other investigators. Biswas, Devine, 
and Flaherty [7] applied a third-order quadrature- based DG method to a scalar wave equation on a NCUBE/2 
hypercube platform and reported a 97.57% parallel efficiency on 256 processors. Bey et al. [8] implemented a 
parallel /ip-adaptive DG method for hyperbolic conservation laws on structured grids. They obtained nearly 
optimal speedups when the ratio of interior elements to subdomain interface elements is sufficiently large. 
In both works, the grids were of a Cartesian type with cell sub-division in the latter case. 

The quadrature- free form of the DG method has been previously implemented and validated [6, 9, 10] 
in an object-oriented code for the prediction of aeroacoustic scattering from complex configurations. The 
code solves the unsteady linear Euler equations on a general unstructured mesh of mixed elements (squares 
and triangles) in two dimensions. The DG code developed by Atkins has been ported [11] to several parallel 
platforms using MPI. A detailed description of the numerical algorithm can be found in reference [6]; and the 
description of the code structure, parallelization routines and model objects can be found in reference [11]. 
In this work, three different parallelization approaches are described and efficiency results for the selected 
approach are reported. 

The next section provides a brief description of the numerical method and is followed by a discussion 
of parallelization strategies, a citation of our standard test case, and performance results of the code on the 
0rigin2000 and several other computing platforms. 

2. Discontinuous Galerkin Method. The DG method is readily applied to any equation of the form 

(2.1) (T + VF(t/)= 0. 

on a domain that has been divided into arbitrarily shaped nonoverlapping elements ft* that cover the domain. 
The DG method is defined by choosing a set of local basis functions B = {6/, 1 < l < N(p,d)} for each 
element, where N is a function of the local polynomial order p and the number of space dimensions d, and 
approximating the solution in the element in terms of the basis set 

N(p>d) 

(2.2) C/q,« V' = Vij bi . 

l=i 

The governing equation is projected onto each member of the basis set and cast in a weak form to give 

(2.3) [ b k .^dn- f Vb k F(V t )dQ + f b k F H (Vi,Vj) ■ n,jds = 0, 

Jn, ut j Qi Jan.j 

where V t is the approximate solution in element Vj denotes the approximate solution in a neighboring 
element ilj, diljj is the segment of the element boundary that is common to the neighboring element f lj, 
Uij is the unit outward-normal vector on and V and Vj denote the trace of the solutions on dQ rj . 

The coefficients of the approximate solution Vij are the new unknowns, and the local integral projection 
generates a set of equations governing these unknowns. The trace quantities are expressed in terms of 
a lower dimensional basis set bi associated with cft2 tj . F R denotes a numerical flux which is usually an 
approximate Riemann flux of the Lax- Friedrichs type. 

Because each element has a distinct local approximate solution, the solution on each interior edge is 
double valued and discontinuous. The approximate Riemann flux F R (V t , Vj) resolves the discontinuity and 
provides the only mechanism by which adjacent elements communicate. The fact that this communication 
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occurs in an edge integral means the solution in a given element V t depends only on the edge trace of the 
neighboring solution Vj, not on the whole of the neighboring solution V 3 . Also, because the approximate 
solution within each element is stored as a function, the edge trace of the solution is obtained without 
additional approximations. 

The DG method is efficiently implemented on general unstructured grids to any order of accuracy using 
the quadrature-free formulation. In the quadrature- free formulation, developed by Atkins and Shu in [6], 
the flux vector F is approximated in terms of the basis set hi , and the approximate Riemann flux F R is 
approximated in terms of the lower basis set hi: 


(2.4) 


F(Vi 


N(p,d) 

E 

i = i 


N(pM-l) 


F R {Vi, Vj) ■ n, = E 


1 = 1 


With these api>roximations, the volume and boundary integrals can be evaluated analytically, instead 
of by quadrature, leading to a simple sequence of matrix-vector operations 

d[Vij] l » v r * , V— v r-tf 


(2.5) 

where 

( 2 . 6 ) 


dt 


= (M _l A) [/,;.(] - E( M_,B «)[/«.ij. 

{ J } 


M 


[ b k bi dil 

A = 

[ Vbk b t (in 

, B,j = / b k bi ds 

Un 


Jn 



The matrices M, A, and B r j depend only on the shape of the similarity element and the degree of the 
solution p. Thus, the set of matrices associated with a particular similarity element can be precomputed 
and applied to all elements that map to it at a considerable savings of both storage and computation. The 
residual of equation (2.5) is evaluated by the following sequence of operations: 


[%<] 

= Ty K<] 1 

= F(Vj) ■ Hij j 


Vfti, 

\1U 

= F R (Vi,Vj) 


vdn ?J 

[/,/] 

= F(Vi) 

V 


dt 

= (M- 1 A) [/<.,] 

-E(M- , B, J )[7j ; ] 

W 

> VSh, 


where T ^ is the trace operator, and [() i ■ J denotes a vector containing the coefficients of an edge quantity 
on edge j. Edges will be referred to as interior edges or boundary edges when it is necessary to distinguish 
between the two. 


3. Parallel Computation. In this section, three different possible parallelization strategies for the 
DG method are described. The first approach is symmetric and easy to implement but results in redundant 
flux calculations. The second and third approaches eliminate the redundant flux calculations; however, the 
communication occurs in two stages making it more difficult to overlap with computation, and increasing 
the complexity of the implementation. The following notation will be used to describe the parallel imple- 
mentation. Let ft denote any element, let dft p denote any edge on the partition boundary, and dilj denote 
any other edge. The first approach is symmetric and is easily implemented in the serial code. It can be 
summarized as follows: 
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1. Compute [ttj,/] and [fjj] Vf2 

2. Send [vjj] and [fjj] on dil p to neighboring partitions 

3. Compute [//] and (M -1 A)[//] Vf2, and [/^] \/OQi 

4. Receive and [/ j f ] on dQ p from neighboring partitions 

5. Compute [/",] VcMi„ and (M-‘B, )[/*,] Vft 

In this approach, nearly all of the computation is scheduled to occur between the nonblocking send and 

pf 

receive; however, the edge flux [fjj] is doubly computed on all partition edges di} p . It is observed in actual 

pf 

computations that redundant calculation is not a significant factor. The calculation of [fijj] on all edges, 
dQ tJ , represents only 2% to 3% of the total CPU time, and the redundant computation is performed on only 
a fraction of these edges. 

The above sequence reflects the actual implementation [11] used to generate the results to be shown later; 
however, this approach offers the potential for further improvement. By collecting the elements into groups 
according to whether or not they are adjacent to a partition boundary, some of the work associated with the 
edge integral (ran also be performed between the send and receive. Let £2p denote any element adjacent to a 
partition boundary and f2/ denote any other element. The following sequence provides maximal overlap of 
communication and computation. 

1. Compute [vjj] and [fjj] V$2 ?) 

2. Send [vjj] and [fjj] on dQ p to neighboring partitions 

3. Compute [Vjj] and [fjj] VS2/ 

4. Compute [/,] and (M-'A)[/<] VO, and (/" ] Vdil, 

5. Compute (M-‘Bj)[7”] VSJ, 

(>. Receive [vjj] and [fjj] on dil p from neighboring partitions 
7. Compute [fjj] W)I2 P and [f^j] V£lp 


3.1. Other Parallelization Strategies. Two variations of an alternative parallelization strategy that 
eliminates the redundant flux calculations are described. In these approaches, the computation of the edge 
flux [fji] on a partition boundary is performed by one processor and the result is communicated to the 
neighboring processor. The processor that performs the flux calculation is said to “own” the edge. In the 
first variation, all edges shared by two processors are owned by only one of the two processors. In the second 
variation, ownership of the edges shared by two processors is divided equally between the two. Let df2p a) 
denote any edge owned by processor A, denote any edge owned by adjacent processor B, and <9f2p ab) 

denote any edge shared by processors A and B. For the purpose of illustration, let ownership of all d^ p b) 
in the first variation be given to processor A. Thus { df2p G ^ } n { } = { 0 } in both variations, and 

{ dQ { p b) }H{ <9f2p 6) } = { 0 } in the first variation. Both computations can be summarized in the following 
steps: 
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Process A 


Process B 


1. Compute [vjj] and [fjj] V Q p Compute [vjj] and [fjj] VTi ;j 

2. Send [Vjj] and [f jA ] VdQ p \dQp l) Send [vjj] and [f jA ] V di} p \dflp b) 


L L v 

3. Compute [vjj] and [fjj} 

4. Compute [//] V Q 

r \ r 

I VO, 

l J ** J L*' J f j 

Compute \vjj] and [.fjj] 
Compute [fi] \/il 

VS2/ 

5. Receive [vjj] and [fjj] 
G. Compute [fjj] 

7. Send [ffj] V d 

vyojf 1 

Receive [vjj] and [/_, ,] 
Compute [ffj] V <0 O}/’ 1 
Send [/J] VdQl b) 

V d Op'* 

8. Compute [ffj] VdQi 


Compute [fjj] 


9. Compute (M -1 A) [//] 

VO 

Compute (M _l A) [//] 

VO 

10. Compute (M _1 Bj) [/^ 

] vo, 

Compute (M _1 Bj) [/£, 

] vo, 

1 1 . Receive \f - { ] V c) ft p \ d flp 1 1 

Receive [fjj] \ dQp }) 

12. Compute (M l 

] vo,, 

Compute (M 'Bj) [/, , 

] vo p 


It is clear that under these strategies, there are no redundant flux calculations. In both variations, the 
total amount of data sent is actually less than that of the symmetric approach presented earlier. However, 
because the sends are performed in two stages, it is more difficult to overlap the communication with useful 
computation. Also, the unsyinmetrie form of edge ownership in the first variation introduces a difficulty in 
balancing the work associated with the edge flux calculation. In the second variation, the number of sends 
is twice that of the symmetric approach; however, because { } n { } — { 0 } in the first 

variation, the total number of sends in this approach is the same as in the symmetric approach presented 
earlier. 


4. Physical Problem. The parallel code is used to solve problems from the Second Benchmark Prol>- 
lems in Computational Aeroacoustics Workshop [12] held at Flordia State University in 1997. The physical 
problem is the scattering of acoustic waves and is well represented by the linearized Euler equations written 
in the form of equation (2.1). Details of the problem can be found in reference [12]. Figure (4.1. a) shows a 
typical partitioned mesh. 



(a) 



Fig. 4.1. Partitioned mesh ( a ), Performance on SP‘2 and workstations dusters (h) 



5. Results and Discussion. Performance tests have been conducted on the SGI 0rigin2000 and IBM 
SP2 platforms, and on clusters of workstations. The first test case applied a third-order method on a coarse 
mesh of only 800 elements. This small problem was run on the SP2 and two clusters of, resp., SGI and 




Fig. 5.1. Speedup for large domain on Origin 2000 (a). Computational Rate (b) 

Sun workstations in an FDDI network. The two clusters consisted of similar but not identical hardware. 
Also, the network was not dedicated to the cluster but carried other traffic. As seen in Figure (4.1. b), 
near linear speedup is obtained on all three machines when the partition size is sufficiently large (> 100 
element/processor). As the domain is divided over more processors, the number of elements assigned to each 
processor becomes too small and the communication overhead becomes apparent. The performance figures 
are far below ideal; however, four to eight distributed workstations still provide a valuable performance 
improvement. 

Three larger problems were used to evaluate the code on the 0rigin2000. For these cases a fifth-order 
method was used and the problem size was controlled bv varying the element size and by varying the location 
of the outer boundary. In the small and medium size problems, a slight superiinear speedup is obtained for 
some range of processors. This result is due to the improvement in cache performance that occurs when a 
fixed problem is divided into smaller parts as the number of processors is increased, and workingsets become 
cache- resident. The larger problem does not fit in cache in 128 processors and no superiinear speedup is 
observed; however, it is expected that performance will improve as the number of processors is increased. This 
is supported by Figure (5.1.b) which shows the computational rate as a function of the number of elements 
on each processor. The computational rate is defined as the average number of elements per processor times 
number of time steps divided by the wall clock time. (Note: because the calculation is time accurate, all 
processors are synchronized at each stage of the Runge-Kutta, and thus, the wall clock time of all processors 
is essentially the same.) Both the small and medium size problems show a 50% increase in the computational 
rate at about 2000 elements per processor. The computational rates for all three problem sizes are similar 
indicating good scalability. Thus, computational rate may be a better indicator of scalability than the usual 
“speedup” measure. 

6. Conclusions. Three parallelization strategies for the DG method have been described. A simple 
symmetric strategy introduces a redundant flux calculation, but maximizes the overlap of computation and 
communication. The redundant flux calculation can be eliminated by introducing a sequence of sends; 
however, with this approach it is more difficult to hide communication with computation, and it leads to a 
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complex implementation. In practice, the overhead due to the redundant flux calculation is negligible. The 
symmetric parallelization approach is implemented in an object-oriented computational aeroacoustics code 
using MPI. Performance results are presented for several distributed memory parallel platforms. The DG 
method provides a significant amount of computational work that is local to each element, this is due to 
the compact form of the method which can be effectively used to hide the communication overhead. The 
compact character of the method is exploited in the implementation. The parallel implementation gives 
superlinear speedup for large problems. This is attributed to the overlapping of computation as well as 
cache accelerations that occur when workingsets are cache 1 resident. 
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